malicious noise
Is nasty noise actually harder than malicious noise?
Blanc, Guy, Huang, Yizhi, Malkin, Tal, Servedio, Rocco A.
We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner. We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings: For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class ${\cal C}$ of functions is efficiently learnable in the presence of $ฮท$-rate malicious noise, then it is also efficiently learnable in the presence of $ฮท$-rate nasty noise. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value $r$ there exists a concept class for which there is a ratio of $r$ between the rate $ฮท_{malicious}$ of malicious noise that polynomial-time learning algorithms can tolerate, versus the rate $ฮท_{nasty}$ of nasty noise that such learning algorithms can tolerate. To offset the negative result for the fixed-distribution setting, we define a broad and natural class of algorithms, namely those that ignore contradictory examples (ICE). We show that for these algorithms, malicious noise and nasty noise are equivalent up to a factor of two in the noise rate: Any efficient ICE learner that succeeds with $ฮท$-rate malicious noise can be converted to an efficient learner that succeeds with $ฮท/2$-rate nasty noise. We further show that the above factor of two is necessary, again under a standard cryptographic assumption.
Fairness, Accuracy, and Unreliable Data
This thesis investigates three areas targeted at improving the reliability of machine learning; fairness in machine learning, strategic classification, and algorithmic robustness. Each of these domains has special properties or structure that can complicate learning. A theme throughout this thesis is thinking about ways in which a'plain' empirical risk minimization algorithm will be misleading or ineffective because of a mis-match between classical learning theory assumptions and specific properties of some data distribution in the wild. The overarching research goal for these related topics is to provide a crisp mathematical model for each learning scenario that exposes different failure modes and makes trade-offs between important metrics explicit in order to provide algorithmic advice or recommendations to practitioners and expose gaps for future research. By tuning our learning algorithms to be more distribution specific in these scenarios, the resulting learned system will exhibit higher utility and avoid catastrophic failure modes. This research is grounded in the theory of machine learning and is fundamentally mathematical in nature, with empirical support when appropriate. Theory is particularly important in these sensitive domains as it is unclear which poor behavior in deployed systems is a natural or benign consequence of a learning system with the underlying distribution,contrasting with problematic but correctable behavior caused by an error in algorithm design or implementation, how to mitigate these issues, or what a successful outcome even looks like in each problem. Theoretical understanding in each domain can help guide best practices and allow for the design of effective, reliable, and robust systems.
Learning large-margin halfspaces with more malicious noise
We describe a simple algorithm that runs in time poly(n, 1/ฮณ, 1/ฮต) and learns an unknown n-dimensional ฮณ-margin halfspace to accuracy 1 ฮต in the presence of malicious noise, when the noise rate is allowed to be as high as ฮ(ฮตฮณ log(1/ฮณ)). Previous efficient algorithms could only learn to accuracy ฮต in the presence of malicious noise of rate at most ฮ(ฮตฮณ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning ฮณ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than ฮ(ฮตฮณ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm.
On the Vulnerability of Fairness Constrained Learning to Malicious Noise
Blum, Avrim, Okoroafor, Princewill, Saha, Aadirupa, Stangl, Kevin
We consider the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data. Konstantinov and Lampert (2021) initiated the study of this question and presented negative results showing there exist data distributions where for several fairness constraints, any proper learner will exhibit high vulnerability when group sizes are imbalanced. Here, we present a more optimistic view, showing that if we allow randomized classifiers, then the landscape is much more nuanced. For example, for Demographic Parity we show we can incur only a $\Theta(\alpha)$ loss in accuracy, where $\alpha$ is the malicious noise rate, matching the best possible even without fairness constraints. For Equal Opportunity, we show we can incur an $O(\sqrt{\alpha})$ loss, and give a matching $\Omega(\sqrt{\alpha})$lower bound. In contrast, Konstantinov and Lampert (2021) showed for proper learners the loss in accuracy for both notions is $\Omega(1)$. The key technical novelty of our work is how randomization can bypass simple "tricks" an adversary can use to amplify his power. We also consider additional fairness notions including Equalized Odds and Calibration. For these fairness notions, the excess accuracy clusters into three natural regimes $O(\alpha)$,$O(\sqrt{\alpha})$ and $O(1)$. These results provide a more fine-grained view of the sensitivity of fairness-constrained learning to adversarial noise in training data.
Realizable Learning is All You Need
Hopkins, Max, Kane, Daniel, Lovett, Shachar, Mahajan, Gaurav
The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust and private learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions or general loss, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g.\ noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.
Sample-Optimal PAC Learning of Halfspaces with Malicious Noise
We study efficient PAC learning of homogeneous halfspaces in $\mathbb{R}^d$ in the presence of malicious noise of Valiant~(1985). This is a challenging noise model and only until recently has near-optimal noise tolerance bound been established under the mild condition that the unlabeled data distribution is isotropic log-concave. However, it remains unsettled how to obtain the optimal sample complexity simultaneously. In this work, we present a new analysis for the algorithm of Awasthi et al.~(2017) and show that it essentially achieves the near-optimal sample complexity bound of $\tilde{O}(d)$, improving the best known result of $\tilde{O}(d^2)$. Our main ingredient is a novel incorporation of a Matrix Chernoff-type inequality to bound the spectrum of an empirical covariance matrix for well-behaved distributions, in conjunction with a careful exploration of the localization schemes of Awasthi et al.~(2017). We further extend the algorithm and analysis to the more general and stronger nasty noise model of Bshouty~et~al. (2002), showing that it is still possible to achieve near-optimal noise tolerance and sample complexity in polynomial time.
Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Birnbaum, Aharon, Shwartz, Shai S.
Given $\alpha,\epsilon$, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most $(1+\alpha)\,L^*_\gamma + \epsilon$, where $L^*_\gamma$ is the optimal $\gamma$-margin error rate. For $\alpha = 1/\gamma$, polynomial time and sample complexity is achievable using the hinge-loss. For $\alpha = 0$, \cite{ShalevShSr11} showed that $\poly(1/\gamma)$ time is impossible, while learning is possible in time $\exp(\tilde{O}(1/\gamma))$. An immediate question, which this paper tackles, is what is achievable if $\alpha \in (0,1/\gamma)$. We derive positive results interpolating between the polynomial time for $\alpha = 1/\gamma$ and the exponential time for $\alpha=0$. In particular, we show that there are cases in which $\alpha = o(1/\gamma)$ but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model.
Learning large-margin halfspaces with more malicious noise
We describe a simple algorithm that runs in time poly(n,1/gamma,1/eps) and learns an unknown n-dimensional gamma-margin halfspace to accuracy 1-eps in the presence of malicious noise, when the noise rate is allowed to be as high as Theta(eps gamma sqrt(log(1/gamma))). Previous efficient algorithms could only learn to accuracy eps in the presence of malicious noise of rate at most Theta(eps gamma). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning gamma-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Theta(eps gamma); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm.